Recap
Interpretation
Interpreting simple scenarios:
Matrix equation fits mean of \(Y\) as a linear function of many variables:
\[Y_i = b_0 + b_1 X_{1i} + b_2 X_{2i} + \ldots + b_k X_{ki}\]
\(\hat{Y_i}\) is the predicted mean of \(Y\) given values of \(\mathbf{X}_i\)
We obtain coefficients \(b_0, b_1, \dots, b_k = \beta\):
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]
This is the matrix formula for least squares regression.
If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)
If \(X\) is a column of \(1\)s and a column of \(X_1\), it is an intercept and slope on \(X_1\).
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]
The mathematical procedures we use in regression ensure that:
\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\) or \(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
The mathematical procedures we use in regression ensure that:
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
The mathematical procedures we use in regression ensure that:
\(3\). \(\overline{Y} = \overline{X}'\beta\): the predicted value \(\hat{Y}\) when all variables in \(X\) are at their mean is the mean of \(Y\).
The mathematical procedures we use in regression ensure that:
\(4\). Slope \(b_k\) captures change in \(Y\) due to variation in \(X_k\) that is orthogonal/uncorrelated to all other columns of \(X\).
\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)
where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:
\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)
\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)
How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)
\[E(Y_i | X_i) = f(X)\]
Conditional: mean of \(Y\) is different at different levels of \(X\)
Expectation: expected value/mean of \(Y\)
Function: mathematical function links values of \(X\) to mean of \(Y\). Could be, does not need to be, linear
How does multivariate regression relate back to Conditional Expectation Function?
Least Squares predictions \(\widehat{Y} = X'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).
Exercise: Download the data. Contains mean income for each group of people with the same AGE in years
Exercise:
Use matrix calculations in R to obtain the linear approximation of \(E(Y_i | X_i)\):
#With individual data...
X = cbind(1, acs_data$AGE)
y = acs_data$INCEARN
beta = solve(t(X) %*% X) %*% t(X) %*% y
beta## [,1]
## [1,] 17501.992
## [2,] 3668.629
Do your estimates agree? Why or why not?
We are not accounting for the number of people in each value of \(X_i\)
#With aggregate data...
X = cbind(1, cef$AGE)
y = cef$INCEARN
w = cef$respondents
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y
beta## [,1]
## [1,] 17501.992
## [2,] 3668.629
We can reproduce any individual-level least squares by:
We are directly modelling the conditional expectation of \(E(Y_i | X_i)\) as a linear function.
Least Squares estimates the mean of \(Y\) as a function of values of \(X\).
What are they?
Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise.
Can be used for variables with many mutually exclusive categories (e.g. race, marital status, etc.)
How do we make them?
R with as.factor(): each unique value will get a dummy.lm(y ~ as.factor(any_variable))
\(Earnings_i = b_0 + b_1 \ Female_i\)
assuming, as the survey does, the gender binary
\(Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i\)
\(Earnings_i = b_0 \ Female_i + \ b_1 Male_i\)
## (Intercept) FEMALE
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 1
## (Intercept) FEMALE
## 201941.70 -57269.02
## (Intercept) FEMALE MALE
## 1 1 0 1
## 2 1 0 1
## 3 1 0 1
## 4 1 1 0
## 5 1 0 1
## 6 1 1 0
## (Intercept) FEMALE MALE
## 201941.70 -57269.02 NA
## FEMALE MALE
## 1 0 1
## 2 0 1
## 3 0 1
## 4 1 0
## 5 0 1
## 6 1 0
## FEMALE MALE
## 144672.7 201941.7
If there is an intercept
R will do this by default)If there is no intercept:
How did different regions in the UK vote on Brexit?
brexit data: https://pastebin.com/Y8AmJK7Ctable function to find out how what the different Region names arehist function to visually examine the distribution of the leave vote (Pct_Lev)Using the lm function: (lm(y ~ x, data = data))
Pct_Lev on Regionsummary(your_model))Pct_Lev on -1 + Region##
## East East Midlands London
## 47 40 33
## North East North West Northern Ireland
## 12 39 1
## Scotland South East South West
## 32 67 37
## Wales West Midlands Yorkshire and The Humber
## 22 30 21
| Model 1 | |
|---|---|
| (Intercept) | 39.137 (1.394)*** |
| RegionULondon | −0.045 (1.956) |
| RegionUNorth West | 16.778 (1.881)*** |
| RegionUYorkshire and The Humber | 19.513 (2.215)*** |
| RegionUNorth East | 20.342 (2.669)*** |
| RegionUWest Midlands | 21.178 (2.004)*** |
| RegionUEast Midlands | 20.438 (1.870)*** |
| RegionUSouth West | 14.547 (1.904)*** |
| RegionUEast | 17.825 (1.807)*** |
| RegionUSouth East | 13.033 (1.695)*** |
| RegionUWales | 14.211 (2.184)*** |
| RegionUNorthern Ireland | 5.083 (8.008) |
| Num.Obs. | 381 |
| R2 | 0.443 |
| R2 Adj. | 0.426 |
| AIC | 2668.6 |
| BIC | 2719.9 |
| Log.Lik. | −1321.302 |
| F | 26.679 |
How is this different from what you produced?
During the Brexit vote, many areas thought to support “Remain” experienced heavy rainfall. total_precip_prepolling is the number of mm of rain that fell just before polls were open.
Pct_Lev on total_precip_prepolling + Region.total_precip_prepolling on calculated in this case? (hint: what is \(total \ precip \ prepolling^*\))total_precip_prepolling as precisely as possible?| Model 1 | |
|---|---|
| (Intercept) | 52.558 (1.496)*** |
| total_precip_prepolling | 0.052 (0.012)*** |
| RegionEast Midlands | 7.012 (1.927)*** |
| RegionLondon | −19.266 (1.775)*** |
| RegionNorth East | 6.921 (2.677)* |
| RegionNorth West | 3.357 (1.938)+ |
| RegionNorthern Ireland | −8.338 (7.836) |
| RegionScotland | −13.428 (2.021)*** |
| RegionSouth East | −3.112 (1.511)* |
| RegionSouth West | 1.110 (1.957) |
| RegionWales | 0.790 (2.220) |
| RegionWest Midlands | 7.755 (2.052)*** |
| RegionYorkshire and The Humber | 6.091 (2.248)** |
| Num.Obs. | 381 |
| R2 | 0.471 |
| R2 Adj. | 0.454 |
| AIC | 2650.6 |
| BIC | 2705.8 |
| Log.Lik. | −1311.311 |
| F | 27.354 |
\(2\): We regress Percent Leave on the residual pre-election precipitation.
The other variable are dummies for region. Thus, the \(\widehat{precipitation_i}\) is the mean precipitation in each region.
So, residual \(precipitation_i^*\), is difference between precipitation in a consituency and the mean precipitation in the region.
Slope is relationship between within-region precipitation and leave voteshare.
You can change the “reference group” like this. This code shifts “Scotland” to the reference by making it the first “level” in the factor.
levels = c('Scotland', 'London', 'North West',
'Yorkshire and The Humber', 'North East',
'West Midlands', 'East Midlands', 'South West',
'East', 'South East', 'Wales', 'Northern Ireland')
brexit$RegionU = factor(brexit$Region, levels = levels)
brexit = brexit %>% mutate(RegionU = factor(Region, levels = levels))
brexit[, RegionU := factor(Region, levels = levels)]Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.
Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.
UHRSWORK (the usual hours worked per week)FEMALE (an indicator for female)AGE (in years)LAW (an indicator for being a lawyer)INCEARN (total annual income earned in dollars).Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:
##
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304486 -90963 -29238 72845 743690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11083.5 9484.6 1.169 0.243
## UHRSWORK 1075.5 114.7 9.373 <2e-16 ***
## AGE 3452.8 125.2 27.578 <2e-16 ***
## LAW -48919.9 2674.6 -18.291 <2e-16 ***
## FEMALE -37223.1 2799.8 -13.295 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared: 0.146, Adjusted R-squared: 0.1457
## F-statistic: 427.2 on 4 and 9995 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE?
Previous slide shows linear approximation of CEF of earnings by age.
In your small groups:
Can you think of a way to use least squares to better approximate the true conditional expectation function of earnings by Age?
How would you do it?
Try it out:
How would you incorporate this into the model below?
ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)